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LEARNING  MULTISCALE  SPARSE  REPRESENTATIONS  FOR 
IMAGE  AND  VIDEO  RESTORATION* * * § 

JULIEN  MAIRAL+,  GUILLERMO  SAPIRO*,  AND  MICHAEL  ELAD§ 


Abstract.  A  framework  for  learning  multiscale  sparse  representations  of  color  images  and  video 
with  overcomplete  dictionaries  is  presented  in  this  paper.  Following  the  single-scale  grayscale  K-SVD 
algorithm  introduced  in  [1],  which  formulates  the  sparse  dictionary  learning  and  image  representation 
as  an  optimization  problem  efficiently  solved  via  orthogonal  matching  pursuit  and  SVD,  this  proposed 
multiscale  learned  representation  is  obtained  based  on  an  efficient  quadtree  decomposition  of  the 
learned  dictionary  and  overlapping  image  patches.  The  proposed  framework  provides  an  alternative 
to  pre-defined  dictionaries  such  as  wavelets,  and  leads  to  state-of-the-art  results  in  a  number  of 
image  and  video  enhancement  and  restoration  applications.  The  presentation  of  the  framework  here 
proposed  is  accompanied  by  numerous  examples  demonstrating  its  practical  power. 

Key  words.  Image  and  video  processing,  sparsity,  dictionaries,  multiscale  representation,  de¬ 
noting,  inpainting,  interpolation,  learning. 

AMS  subject  classifications.  49M27,  62H35 

1.  Introduction.  Consider  a  signal  x  E  Mn.  We  say  that  it  admits  a  sparse 
approximation  over  a  dictionary  D  E  Mnx/c,  composed  of  k  elements  referred  to  as 
atoms,  if  one  can  find  a  linear  combination  of  “few”  atoms  from  D  that  is  “close”  to 
the  signal  x.  The  so-called  Sparseland  model  suggests  that  such  dictionaries  exist  for 
various  classes  of  signals,  and  that  the  sparsity  of  a  signal  decomposition  is  a  powerful 
model  in  many  image  and  video  processing  applications  [1,  19,  25,  35]. 

An  important  assumption,  commonly  and  successfully  used  in  image  process¬ 
ing,  is  the  existence  of  multiscale  features  in  images.  Attempting  to  design  the  best 
multiscale  dictionary  which  fulfils  a  sparsity  criterion  has  been  a  major  challenge  in 
recent  years.  Such  attempts  include  wavelets  [26],  curvelets  [5,  6],  contourlets  [14,  15], 
wedgelets  [16],  bandlets  [27,  28],  and  steerable  wavelets  [20,  38].  These  methods  lead 
to  many  effective  algorithms  in  image  processing,  e.g.,  image  denoising  [34].  In  this 
paper,  instead  of  designing  the  best  pre-defined  dictionary  for  image  reconstruction, 
we  propose  to  learn  it  from  examples. 

In  [1],  the  K-SVD  algorithm  is  proposed  for  learning  a  single-scale  dictionary  for 
sparse  representation  of  grayscale  image  patches.  By  means  of  a  sparsity  prior  on  all 
fixed-sized  overlapping  patches  in  the  image,  the  K-SVD  is  used  for  removing  white 
Gaussian  noise,  leading  to  a  highly  efficient  algorithm  [19].  This  has  been  extended  to 
color  images,  with  state-of-the-art  results  in  denoising,  inpainting,  and  demosaicing 
applications  [25],  and  more  recently  to  video  denoising  [35].  In  this  paper,  we  extend 
the  basic  K-SVD  work,  providing  a  framework  for  learning  multiscale  and  sparse  image 
representation.  In  addition  to  the  presentation  of  the  new  methodology,  we  apply  it  to 
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various  image  and  video  processing  tasks,  obtaining  results  that  outperform  previous 
works.  Our  results  for  denoising  grayscale  images  outperform  for  instance  works  such 
as  [9,  18,  19,  21,  34,  37].  The  proposed  algorithm  also  competes  favorably  with  the 
most  recent  and  state-of-the-art  result  in  this  field  [11],  which  is  based  on  the  non¬ 
local  means  algorithm  [4].  Our  framework  for  color  image  denoising  also  competes 
favorably  with  the  best  known  algorithm  in  this  field  [10],  and  the  results  for  the  other 
presented  applications  such  as  color  video  denoising  and  inpainting  of  small  holes  in 
image  and  video,  are  also  among  the  best  we  are  aware  of. 

The  task  of  learning  a  multiscale  dictionary  has  been  addressed  in  [32]  in  the 
general  context  of  sparsifying  image  content.  Our  approach  differs  from  this  work 
in  many  ways,  including:  (i)  their  training  algorithm  employs  a  simple  steepest  de¬ 
scent  while  ours  uses  more  effective  iterations,  thus  leading  to  faster  convergence;  (ii) 
the  structure  of  the  multiscale  process;  and  (iii)  the  way  the  found  dictionaries  are 
deployed  for  denoising  is  entirely  different,  as  we  base  our  algorithm  on  the  energy 
minimization  method  introduced  in  [19].  This  explains  the  significantly  superior  per¬ 
formance  we  obtain.  Other  results  on  learning  single-scale  image  dictionaries  include 
for  example  [36,  37,  42]. 

The  structure  of  this  paper  is  as  follows:  In  Section  2,  we  briefly  review  relevant 
background:  The  original  K-SVD  denoising  algorithm  [1],  the  extensions  to  color 
image  denoising,  non- homogeneous  noise,  and  inpainting  [25],  and  the  K-SVD  for 
denoising  videos  [35] .  Section  3  is  devoted  to  the  presentation  of  our  multiscale  scheme, 
and  this  section  is  followed  by  two  sections  that  introduce  important  algorithmic 
improvements  to  the  original  single-scale  K-SVD.  Section  6  presents  some  applications 
of  the  multiscale  K-SVD,  covering  grayscale  and  color  image  denoising  and  image 
inpainting.  In  Section  7  we  show  the  performance  for  video  processing.  Section 
8  concludes  this  paper  with  a  brief  description  of  its  contributions  and  some  open 
questions  for  future  work. 

2.  The  single-scale  K-SVD.  The  single-scale  K-SVD  has  already  shown  very 
good  performance  for  grayscale  image  denoising  [18,  19],  color  image  denoising  [25], 
inpainting  and  demosaicing  [25],  and  video  denoising  [35].  In  this  section,  we  briefly 
review  these  algorithms. 

2.1.  The  grayscale  image  denoising  K-SVD  algorithm.  We  now  briefly 
review  the  main  ideas  of  the  K-SVD  framework  for  sparse  image  representation  and 
denoising.  The  reader  is  referred  to  [18,  19]  for  additional  details. 

Let  xo  be  a  clean  image  and  y  =  xo  +  w  its  noisy  version  with  w  being  an  additive 
zero- mean  white  Gaussian  noise  with  a  known  standard  deviation  a.  The  algorithm 
aims  at  finding  a  sparse  approximation  of  every  y/n  x  yjn  overlapping  patch  of  y, 
where  n  is  fixed  a-priori.  This  representation  is  done  over  an  adapted  dictionary  D, 
learned  for  this  set  of  patches.  These  approximations  of  patches  are  averaged  to  obtain 
the  reconstructed  image.  This  algorithm  (shown  in  Figure  2.1)  can  be  described  as 
the  minimization  of  an  energy: 

{<%,£>, x}  =  argmin  A||x  —  y 1 1|  +'Y'mj\\oiij\\o  +  W  HDo^  -  R^xl^  .  (2.1) 

D ,aa  ,x 

’  3:  ij  13 

In  this  equation,  x  is  the  estimator  of  xo,  and  the  dictionary  D  G  Rnxk  is  an  estimator 
of  the  optimal  dictionary,  which  leads  to  the  sparsest  representation  of  the  patches 
in  the  recovered  image.  The  indices  [i,j]  mark  the  location  of  the  patch  in  the  image 
(representing  it’s  top-left  corner).  The  vector  aij  E  M.k  is  the  sparse  representation 
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for  the  [i,j]~ th  patch  in  x  using  the  dictionary  D.  The  notation  ||.||o  is  the  1°  quasi- 
norm,  a  sparsity  measure,  which  counts  the  number  of  non-zero  elements  in  a  vector. 
The  operator  is  a  binary  matrix  which  extracts  the  square  x  y/n  patch  of 
coordinates  [i,  j]  from  the  image  written  as  a  column  vector.  The  main  steps  of  the 
algorithm  are  (refer  to  Figure  2.1): 

•  Sparse  Coding  Step:  This  is  performed  with  an  Orthogonal  Matching  Pursuit 
(OMP)  [12,  13,  29],  which  proves  to  be  very  efficient  for  diverse  approximation 
problems  [17,  40,  41].  The  approximation  stops  when  the  residual  reaches  a 
sphere  of  radius  ^JnCcr  representing  the  probability  distribution  of  the  noise 
(C  being  a  constant).  More  on  this  can  be  found  in  [25]. 

•  Dictionary  Update:  This  is  a  sequence  of  one-rank  approximation  problems 
that  update  both  the  dictionary  atoms  and  the  sparse  representations  that 
use  it,  one  at  a  time. 

•  Reconstruction:  The  last  step  is  a  simple  averaging  between  the  patches’ 
approximations  and  the  noisy  image.  The  denoised  image  is  x.  Equation 
(2.4)  emerges  directly  from  the  energy  minimization  in  Equation  (2.1). 


Parameters:  A  (Lagrange  multiplier);  C  (noise  gain);  J  (number  of  iterations); 
k  (number  of  atoms);  n  (size  of  the  patches). 

Initialization:  Set  x  =  y;  Initialize  D  =  (dj  G  Rnxl)ie 1...&  (e.g.,  redundant 
DCT). 

Loop:  Repeat  J  times 

•  Sparse  Coding:  Fix  D  and  use  OMP  to  compute  coefficients  &ij  G  M/cxl 
for  each  patch  by  solving: 

\/ij  otij  —  argmin  ||o||o  subject  to  ||R^-x  —  Do  \\l<n{Ca)2.  (2.2) 

a 

•  Dictionary  Update:  Fix  all  cqj,  and  for  each  atom  d /,  l  G  1,  2, . . . ,  k  in  D, 

—  Select  the  set  of  patches  (Ji  that  use  this  atom, 

LOi  :=  {[i,j]\aij(l)  ±  0}. 

—  For  each  patch  [i,  j]  G  wi,  compute  its  residual: 
e\j  =  Ryi  -  D atij  +  d ;<%(/). 

—  Set  Ej  as  the  matrix  whose  columns  are  the  eh,  and  a1  the  vector 
whose  elements  are  the 
—  Update  dj  and  the  by  minimizing: 

(d i,al)=  argmin  \\Ei  -  daT\\2F.  (2.3) 

a,\\d\\2=l 

This  one-rank  approximation  is  performed  by  a  truncated  S VD  of  E i . 
Reconstruction:  Perform  a  weighted  average: 

*  =  (AI + E  RSRii) _1  (Ay  +  E  R?Tu)  •  (2-4) 

ij  ij 


Fig.  2.1.  The  single-scale  K-SVD-based  grayscale  image  denoising  algorithm. 

When  it  was  designed,  this  algorithm  provided  state-of-the-art  results.  One  of 
its  main  contributions  was  its  possibility  to  learn  a  dictionary  on  a  large  database 
of  images  (the  so-called  global  approach),  thereby  exploiting  intrinsic  information 
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of  natural  images,  or  to  learn  a  dictionary  over  all  the  overlapping  patches  of  one 
image  ( adaptive  approach),  exploiting  gathered  information  from  the  whole  image 
at  a  specific  location.  As  in  [25],  we  typically  learn  off-line  a  global  dictionary  and 
then  use  it  as  an  initial  dictionary  at  the  beginning  of  the  iterative  adaptive  approach 
presented  in  Figure  2.1. 

2.2.  The  color  image  denoising  extension.  In  [25],  we  showed  that  we  can 
apply  the  K-SVD  for  color  image  by  denoising  each  RGB  patch  directly  as  a  long 
concatenated  RGB  vector.  Within  this  framework,  the  algorithm  is  able  to  learn  the 
correlation  between  the  RGB  channels,  and  exploit  it  effectively.  This  was  shown  to 
provide  improved  results  over  the  denoising  of  each  color  channel  separately. 

Nevertheless,  we  observed  on  some  images  a  color  bias,  and  especially  so  when 
we  used  the  global  dictionaries.  Our  study  has  shown  that  this  phenomenon  happens 
because  the  dictionary  redundancy  was  too  small  to  represent  the  diversity  of  colors 
among  natural  images.  We  used  therefore  a  different  metric  during  the  orthogonal 
matching  pursuit  that  maintains  the  average  color  of  the  original  image.  With  this 
intention,  we  introduced  a  new  parameter  7  that  enforces  the  average  color  of  the 
patches.  Additional  details  and  numerous  examples  are  given  in  [25],  showing  that  the 
proposed  framework  leads  to  state-of-the-art  results,  which  are  further  improved  with 
the  multiscale  approach  and  additional  algorithmic  improvements  here  introduced. 

2.3.  Handling  non-homogeneous  noise  and  inpainting.  The  problem  of 
handling  non- homogenous  noise  is  very  important  as  non-uniform  noise  across  color 
channels  is  very  common  in  digital  cameras.  In  [25],  we  presented  a  variation  of  the 
K-SVD,  which  permits  to  address  this  issue.  Within  the  limits  of  this  model,  we  were 
able  to  fill-in  relatively  small  holes  in  the  images  and  we  presented  state-of-the-art 
results  for  image  demosaicing,  outperforming  every  specialized  interpolation-based 
methods,  such  as  [7,  22,  23,  31]. 

Let  us  now  consider  the  case  where  w  is  a  white  Gaussian  noise  with  a  different 
standard  deviation  ap  >  0  at  each  location  p.  Assuming  these  standard  deviations 
are  known,  we  introduced  a  vector  /?  composed  of  weights  for  each  location: 

a  _  mmp/e  Image  &p' 

Pp  —  • 

CTp 

This  leads  us  to  define  a  weighted  K-SVD  algorithm  based  on  a  different  metric 
for  each  patch.  Since  the  fine  details  of  this  approach  are  given  in  [25],  we  restrict  the 
discussion  here  to  a  rough  coverage  of  the  main  idea.  Denoting  by  0  the  element-wise 
multiplication  between  two  vectors,  we  aim  at  solving  the  following  problem,  which 
replaces  Equation  (2.1): 

{<%D,x}  =  arg min  A| |/3® /3<8>  (x  -  y)|||  +  Vi  j\\aij  Wo  + 

D,aij,x 

Ell(R«/3)  ®  (D<%  “  R*ix)  1 1 2  •  (2-5) 

b 

As  explained  in  [25],  there  are  two  main  modifications  in  the  minimization  of  this 
energy.  First,  the  Sparse  Coding  step  takes  the  matrix  f3  into  account  by  using  a 
different  metric  within  the  OMR  Second,  the  Dictionary  Update  variation  is  more 
delicate  and  Equation  (2.3)  is  replaced  by 

(d i,al)=  argmin  \\/3l  <g>  (E;  -  daT)\\2F, 
a,||dll2  =  l 


(2.6) 
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where  (3l  is  the  matrix  whose  size  is  the  same  as  and  where  each  column  corre¬ 
sponding  to  an  index  [i,j]  is  R ij/3.  This  problem  is  known  as  a  weighted  one-rank 
approximation  matrix  (see  [39]).  The  algorithm  that  we  used  to  approximate  a  solu¬ 
tion  is  presented  Figure  2.2. 


Input:  E  (n  x  m  matrix),  /3  (nxm  matrix  of  weights),  iter  (number  of  iterations, 
typically  5). 

Output:  diter  (n  x  1  vector),  a±ter  (m  x  1  vector) 

Problem  to  solve: 

argmin  \\/3  <g)  (E  —  doT)||^. 

ot, \\d\\2  =  l 

Initialization:  do  =  0,  oo  =  0. 

Loop:  For  i  from  1  to  iter,  use  a  truncated  SVD  to  solve: 

(d *, o:*)  =  argmin  ||/3  <g>  E  +  (1  nxm  -  (3)  0dj_icAi  -  daT|| 2F, 

a,\\d\\2  =  l 

where  lnxm  is  an  n  x  m  matrix  filled  with  ones. 


Fig.  2.2.  A  weighted  one-rank  approximation  algorithm. 

Inpainting,  e.g.,  [2,  8],  consists  of  filling- in  holes  in  images.  Within  the  limits  of  our 
model,  it  becomes  possible  to  address  a  particular  case  of  inpainting.  By  considering 
small  random  holes  as  areas  with  infinite  power  noise,  our  weighted  K-SVD  algorithm 
proved  to  be  very  efficient.  This  inpainting  case  could  also  be  considered  as  a  specific 
case  of  interpolation.  The  mathematical  formulation  from  Equation  (2.5)  remains  the 
same,  but  some  values  from  the  matrix  f3  are  just  equal  to  zero.  Details  about  this 
are  provided  in  [25],  together  with  a  discussion  on  how  to  handle  the  demosaicing 
problem  that  has  a  fixed  and  periodic  pattern  of  missing  values. 

2.4.  The  video  denoising  algorithm.  The  video  extension  of  the  K-SVD  has 
been  developed  and  described  in  [35].  This  work  exploits  the  temporal  correlation 
in  video  signals  to  increase  the  denoising  performance  of  the  algorithm,  providing 
state-of-the-art  results  for  removing  white  Gaussian  noise. 

As  explained  in  [35],  applying  the  previously  described  K-SVD  on  the  whole 
video  volume  as  one  signal  is  problematic  due  to  the  rapid  changes  in  the  video 
content,  implying  that  one  dictionary  will  not  be  able  to  fit  well  to  the  whole  data. 
At  the  other  extreme,  an  alternative  method  that  applies  the  K-SVD  single-image 
denoising  algorithm,  as  described  above,  to  the  image  sequence  is  also  expected  to 
perform  poorly,  since  we  do  not  exploit  the  temporal  correlation.  Therefore,  a  different 
approach  is  proposed  in  [35],  based  on  the  following  concepts: 

•  3D  Atoms :  Each  frame  should  be  denoised  separately,  but  patches  are  con¬ 
structed  from  more  than  one  frame,  grasping  both  spatial  and  temporal  be¬ 
haviors. 

•  Dictionary  propagation:  The  initial  dictionary  for  each  frame  is  the  one 
trained  for  the  previous  one.  Fewer  training  iterations  are  thus  required. 

•  Extended  temporal  set  of  patches:  Patches  in  neighboring  frames  are  also  used 
for  dictionary  training  and  image  cleaning  for  each  frame. 

Translating  this  three  concepts  into  a  mathematical  formulation  leads  to  the  fol¬ 
lowing  modified  version  of  Equation  (2.1),  the  reader  should  refer  to  [35]  for  additional 
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details: 


Vt  el...  T  , 

{aijk,f)t,xt}  =  argmin  A||x*  —  yt|||  + 

Dt 

t+ At 

E  E  l^ijk  Jfcxlll  •  (2.7) 

zj  k=t—At 

Having  concluded  the  brief  background  presentation,  we  proceed  to  present  a 
multiscale  framework  that  permits  to  improve  all  of  the  above  mentioned  algorithms. 
We  should  note  that  the  approach  we  are  about  to  present  is  one  among  several  pos¬ 
sibilities  for  introducing  multiscale  analysis  into  the  dictionary  learning  and  sparse 
image/video  representation  framework.  This  means  that  further  work  could  (and 
should)  be  done  to  explore  alternative  possibilities,  in-spite  of  the  fact  that  the  ap¬ 
proach  here  presented  already  leads  to  state-of-the-art  results. 

3.  The  multiscale  sparse  representation.  Since  it  is  well  accepted  that  image 
information  spreads  across  multiple  scales,  designing  a  K-SVD  type  of  algorithm  that 
is  able  to  adapt  and  capture  information  at  multiple  scales  is  the  main  goal  of  this 
paper.  This  section  discusses  the  main  principles  of  our  proposed  approach. 

One  simple  and  naive  strategy  to  introduce  multiscale  analysis  consists  of  using 
large  patches  with  a  high  redundancy  factor  (^),  and  hope  for  the  appearance  of 
intrinsic  multiple  scales  among  the  learned  dictionary’s  atoms.  However,  we  have 
observed  no  significant  differences  between  the  results  with  the  parameters  {n  =  8x 
8,  k  =  256}  compared  to  {n  =  16  x  16,  k  =  1024}.  A  number  of  reasons  might  explain 
the  “failure”  of  this  direct  approach.  First,  it  might  be  that  for  low  dimensions  (small 
n)  there  is  no  need  for  multiscale  structure  for  representation  and  denoising,  becoming 
more  crucial  only  as  the  dimension  grows.  In  that  respect,  16  x  16  blocks  might  not  be 
enough  for  the  original  K-SVD  algorithm  to  show  the  multiscale  structure.  Another 
explanation  is  that  the  K-SVD  may  be  trapped  in  a  local  minima,  avoiding  the  true 
multiscale  result.  By  explicitly  imposing  such  multiscale  structure,  we  may  help  in  this 
regard.  This  leads  us  naturally  to  the  proposed  framework.  We  note  that  although 
we  present  a  multiscale  extension  of  the  K-SVD  for  image  and  video  enhancement, 
learning  multiscale  dictionaries  is  important  per  se,  also  for  other  applications. 

3.1.  The  basic  model.  In  our  proposed  multiscale  framework,  we  focus  on  the 
use  of  different  sizes  of  atoms  simultaneously.  Considering  the  design  of  a  patch-based 
representation  and  a  denoising  framework,  we  put  forward  a  simple  quadtree  model 
of  large  patches  as  shown  on  Figure  3.1.  This  is  a  classical  data  structure,  also  used 
in  wedgelets  for  example  [16].  A  fixed  number  of  scales,  TV,  is  chosen,  such  that  it 
corresponds  to  N  different  sizes  of  atoms.  A  large  patch  of  size  n  pixels  is  divided 
along  the  tree  to  sub-patches  of  sizes  ns  =  ^ ,  where  s  =  0 ...  N  —  1  is  the  depth  in 
the  tree.  Then,  one  different  dictionary  Ds  composed  of  ks  atoms  of  size  ns  is  learned 
and  used  per  each  scale  s. 

The  overall  idea  of  the  multiscale  algorithm  we  propose  stays  as  close  as  possible 
to  the  original  K-SVD  algorithm,  Figure  2.1,  with  an  attempt  to  exploit  the  several 
existing  scales.  This  aims  at  solving  the  same  energy  minimization  problem  of  Equa¬ 
tion  (2.1),  with  a  multiscale  structure  embedded  within  the  dictionary  D,  which  is 
a  joint  one,  composed  of  all  the  atoms  of  all  the  dictionaries  Ds  located  at  every 
possible  position  in  the  quadtree.  For  the  scale  s,  there  exists  4s  such  positions.  This 
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Fig.  3.1.  Quadtree  model  selected  for  the  proposed  multiscale  framework. 


makes  a  total  of  XlfLo1  4s  atoms  in  D.  This  is  illustrated  in  Figure  3.2,  where  an 
example  of  a  possible  multiscale  decomposition  is  presented. 
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Fig.  3.2.  Possible  decomposition  of  a  20  X  20  patch  with  a  3-scales  dictionary. 

Addressing  the  minimization  problem  of  Equation  (2.1)  with  a  multiscale  dictio¬ 
nary  D  implies  to  consider  equally  the  atoms  from  the  different  scales.  Therefore,  we 
chose  to  normalize  all  the  atoms  of  the  dictionaries  to  one.  This  policy  is  important 
during  the  Sparse  Coding  step  and  proved  to  provide  better  results  than  choosing  one 
different  norm  per  scale  in  our  experiments. 

The  original  K-SVD  exploits  the  overlapping/shift-invariant  treatment  of  the 
patches’  representation,  which  has  been  found  to  be  critical  for  denoising  [18,  19, 
25,  37].  One  characteristic  of  our  multiscale  model  is  that  it  permits  to  force  and  ex¬ 
ploit  this  overlapping/ shift- invariant  sparsity  at  each  scale:  The  use  of  the  quadtree 
does  not  allow  for  all  possible  shifts  for  the  sub-patches  inside  one  large  patch,  by 
letting  only  4s  different  shifts  at  the  scale  s  for  a  sub-patch.  This  prevents  them  from 
constantly  adapting  their  position  to  a  noisy  pattern  and  thereby  learning  it. 

Integrating  the  multiscale  structure  requires  the  following  key  modifications  to 
the  basic  algorithm: 

•  Sparse  Coding:  This  remains  unchanged  if  we  introduce  some  simple  notation. 
In  Equation  (2.2),  assume  that  remains  the  matrix  that  extracts  the 
patch  of  size  no  =  n  with  coordinates  [i,  j}.  The  multiscale  dictionary  D  is 
the  joint  one,  composed  of  all  the  atoms  of  all  the  dictionaries  Ds  =  (ds^  G 
Rnxl)jei...fcs  located  at  every  possible  position  in  the  quadtree  structure.  For 
the  scale  8,  we  denote  their  index  as  p  among  the  4s  possible  shifts.  The 
OMP  is  implemented  efficiently  using  a  Modified  Gram-Schmidt  algorithm 
[3].  During  each  selection  procedure  of  the  OMP,  a  scale  8,  a  position  p 
and  an  atom  ds/  are  chosen.  For  each  patch,  this  step  can  be  achieved  in 
©((Efjo1  Mnll^llo)  operations. 


J.  MAIRAL,  G.  SAPIRO  AND  M.  ELAD 


•  Dictionary  Update:  This  step  is  slightly  changed,  as  we  update  each  atom 
dsi  (1  <  l  <  ks)  in  each  scale  (from  8  =  0  to  s  =  N  —  1): 

—  Select  the  set  of  sub-patches  from  the  scale  s  that  use  the  Z-th  atom, 


Usi  ■=  {[i,j,s,p]\&ij(s,l,p)  ^0}, 


where  [i,j,  s,p\  denotes  the  sub-patch  at  the  scale  s  and  position  p  from 
the  patch  [i,j\,  and  &ij(s,l,p)  is  the  coefficient  corresponding  to  this 
sub-patch  and  the  atom  ds/. 

—  For  each  sub-patch  [i,j,s,p]  E  compute 

^ ijsp  =  Tsp(R^x  Dd^j)  H-  dsi6iij(s,  l^p), 

where  Tsp  E  {0,  l}nsxno  [s  a  binary  matrix  which  extracts  the  sub-patch 
[i,j,s,p]  from  a  patch  [i,j]. 

—  Set  E si  as  the  matrix  whose  columns  are  the  e-Jsp,  and  asl  the  vector 
whose  elements  are  the  dij(s:l,p). 

—  Update  dsi  and  the  d^(8,/,p)  using  a  SVD  as  before: 

(d^a8*)  =  argmin  ||E si  -daT|||,.  (3.1) 

C*,||d||2  =  l 

•  Reconstruction :  Remains  the  same  as  in  Equation  (2.4),  while  using  the 
new  notation  just  introduced.  Note  that  each  patch  is  reconstructed  from 
multiple  scales,  and  since  a  pixel  belongs  to  multiple  (overlapping)  patches, 
it  is  reconstructed  with  multiple  scales  and  at  multiple  positions. 

The  computational  time  of  the  Sparse  Coding  stage  is  paramount  compared  to  the 
Dictionary  Update  and  the  Reconstruction  stages.  The  total  complexity  is  therefore 
®((52iL e1  ks)nLJM )  where  L  is  the  average  sparsity  factor  (number  of  coefficients 
obtained  in  the  decomposition),  and  M  is  the  number  of  patches  processed. 

3.2.  Extension  to  various  image  and  video  enhancement  tasks.  We  now 

show  how  this  framework  is  extended  to  different  applications. 

•  Color  image  denoising:  As  in  the  case  of  the  single-scale  algorithm,  extending 
the  color  framework  to  the  multiscale  version  requires  to  consider  a  concate¬ 
nated  RGB  vector.  Then,  the  same  quadtree  structure  and  the  same  scheme 
is  applied.  The  only  difference  respect  to  the  grayscale  algorithm  is  the  use 
of  the  parameter  7,  which  was  introduced  to  solve  the  bias-color  problem 
that  we  described  in  [25].  We  recall  that  it  enforces  the  average  color  of  the 
patches  during  the  OMP,  thereby  creating  a  new  metric.  In  the  multiscale 
case,  we  can  not  enforce  the  average  color  of  a  patch,  since  it  would  introduce 
a  bias  for  the  sub-patches  in  the  quadtree.  Therefore,  this  enforcement  of  the 
average  color  should  be  done  only  for  the  smallest  sub-patches.  For  instance, 
assume  we  have  N  =  3  scales,  and  a  patch  of  size  n  =  20  x  20  x  3.  Then,  we 
enforce  the  average  color  of  each  5x5x3  sub-patch  within  the  dictionaries 
and  patches. 

•  Non  homogeneous  denoising  and  inpainting:  For  extending  the  non-homoge- 
neous  denoising  and  inpainting  algorithm  to  the  multiscale  version,  one  should 
first  notice  that  for  a  patch  of  index  [i,  j],  the  matrix  R ij/3  that  we  introduced 
in  Section  2.3  can  be  used  directly  during  the  Sparse  Coding  stage,  since  it 
operates  as  a  single-scale  one  with  a  large  dictionary.  Then,  the  Dictionary 
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Update  step  requires  a  decomposition  of  R ij/3  in  a  quadtree  structure,  pro¬ 
viding  a  set  of  matrix  TspR ij/3  for  each  scale  s  and  position  p  within  the 
scale.  Equation  (3.1)  has  to  be  adapted  to  match  Equation  (2.6): 

(d si,asl)  =  argmin  \\(3st  0  (Es*  -  daT)\\2F,  (3.2) 

a,Md||2  =  l 

where  /3si  is  a  matrix  whose  size  is  the  same  as  Es^  and  where  each  column 
corresponding  to  an  index  [i,j]  and  position  p  within  the  scale  s  is  TspR ij/3. 
Again,  the  algorithm  from  Figure  2.2  remains  relevant. 

•  Video  denoising:  We  now  show  how  to  combine  the  multiscale  color  K-SVD 
algorithm  and  the  video  one.  Both  are  using  patches  and  atoms  with  many 
channels:  The  R,G,B  layers  for  the  color  processing,  and  temporal  frames 
for  the  video  processing  (4D-atoms).  Concatenating  the  color  channels  and 
different  frames  in  single  vectors  permits  to  address  the  color  video  denoising 
problem  by  minimizing  the  same  energy  as  in  Equation  (2.7).  For  the  mul¬ 
tiscale  extension  for  denoising  image  sequences,  one  can  see  the  K-SVD  for 
video  as  successive  K-SVD  for  images  applied  to  multi- channels  images.  It 
consists  indeed  of  putting  the  quadtree  structure  on  the  considered  channel 
images,  and  to  consider  the  learning  of  the  dictionaries  at  each  scale,  by  pro¬ 
ceeding  exactly  like  it  was  done  for  the  single-image  case.  Here,  extending 
the  grayscale  video  denoising  to  color  consists  of  handling  concatenated  RGB 
vectors.  Interestingly,  we  found  that  when  handling  a  video,  we  can  omit  the 
use  of  warped  metric  that  uses  the  parameter  7. 

•  N on-homogeneous  denoising  and  inpainting  for  video :  Using  the  same  matrix 
/3t  introduced  for  the  weighted  K-SVD  algorithm  for  the  frame  at  time  £,  the 
video  inpainting  problem  can  be  treated  as  suffering  from  non- homogeneous 
noise.  This  leads  to  the  following  energy  minimization  formulation: 

Vt  El...  T  , 

{aijk,i)t,xt}  =  argmin  X\\/3t  <g>  /3t  <g>  (xt  -  yt)|||  + 
t+At 

E  E  Uijk  ll^ijfcllo  V  \  \(RijPk)  G)  (J-^t^ijk  Rijj/c^)||2' 

ij  k=t- At 

Handling  the  inpainting  problem  for  video  via  an  extension  of  the  previous 
algorithm  is  possible  since  we  can  regard  the  processing  per  each  frame  as 
separate,  although  involving  adjacent  frames.  This  permits  to  use  the  matrix 
f3  exactly  the  same  way  as  we  already  did  for  single  images.  This  handles  in¬ 
painting  of  relatively  small  holes.  For  addressing  the  general  video  inpainting 
problem,  one  should  refer  to  [33,  43].  Due  to  the  multiscale  nature  of  the  pro¬ 
posed  scheme,  somewhat  larger  holes  can  be  treated  successfully,  compared 
to  the  single  scale  algorithm. 

4.  Additional  algorithmic  improvements.  We  now  introduce  important  ad¬ 
ditional  refinements,  which  further  improve  the  results  without  increasing  the  com¬ 
putational  cost. 

First,  for  the  grayscale  K-SVD  we  find  it  useful  to  force  the  presence  of  a  constant 
(DC)  atom  in  each  dictionary,  and  to  give  it  a  preference  by  multiplying  this  atom 
by  a  constant  77  (2.5  for  example)  during  the  selection  procedure  of  the  OMP  (refer 
to  [13]).  This  makes  sense  since  a  constant  atom  does  not  introduce  any  noise  in  the 
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reconstruction.  For  the  color  extension,  we  introduced  one  constant  atom  per  channel, 
i.e. ,  one  red,  one  green,  and  one  blue  atom,  and  for  the  video  K-SVD  algorithm,  one 
constant  atom  (or  constant-per-channel  in  the  color  case)  per  frame  in  the  3D-patches. 

Secondly,  as  discussed  in  [25],  the  stopping  criterion  during  the  OMP  is  based 
on  the  norm  of  an  n- dimensional  Gaussian  vector  which  is  distributed  following  the 
generalized  Rayleigh  law.  This  means  that  one  has  to  stop  the  approximation  when 
the  residual  reaches  a  fuzzy  sphere.  According  to  this  law,  the  bigger  n  is,  the  thinner 
the  sphere  is,  and  the  more  accurate  the  stopping  criterion  y/ nC(n)a  becomes  (C  is  a 
parameter  that  depends  on  n) .  Thus  one  asset  of  increasing  n  through  our  multiscale 
scheme  is  to  provide  an  improved  stopping  criterion. 

It  is  actually  not  necessary  to  perform  a  complete  multiscale  algorithm  to  take  ad¬ 
vantage  of  this  property.  During  the  Sparse  Coding  stage,  instead  of  processing  each 
patch  separately,  one  can  choose  to  process  some  adjacent  sets  of  non-overlapping 
patches  simultaneously  and  consider  them  as  a  larger  patch  (and  therefore  associ¬ 
ated  with  a  better  stopping  criterion).  In  practice,  we  choose  m  adjacent  and  non¬ 
overlapping  patches  of  size  n,  and  we  first  process  them  independently  using  their  own 
stopping  criterion  y/nC(n)a.  Then,  as  long  as  the  cumulative  error  of  the  m  patches  is 
larger  than  the  (better)  stopping  criterion  y/nmC(nm)cr,  we  refine  the  approximation 
by  progressively  adding  terms,  one  at  a  time,  to  the  sparse  expansion  of  the  worse 
of  the  m  patches.  Then  we  consider  a  new  set  of  m  patches  and  continue  the  sparse 
approximation.  This  does  not  increase  the  complexity  of  the  algorithm  and  provides 
noticeable  improvement. 

Finally,  we  mention  a  numerical  shortcut  that  was  proposed  in  [35]  and  we  found 
to  be  useful.  In  Figure  2.1,  the  Sparse  Coding  and  Dictionary  Update  stages  are 
performed  over  the  full  set  of  overlapping  patches.  Performing  these  steps  over  a 
partial  and  random  subset  of  these  patches  during  all  the  iterations  (apart  from  the 
last  one) ,  leads  to  a  substantial  reduction  in  the  computational  time  and  the  memory 
requirements. 

5.  A  block  denoising  variation.  Analyzing  the  performance  of  the  K-SVD- 
based  image  denoising  algorithm  raises  some  interesting  questions.  Let  us  consider 
an  “homogeneous”  image  that  can  be  represented  reliably  using  one  dictionary  Dopt. 
Then,  the  bigger  the  image  is,  the  better  the  denoising  results  are,  since  we  get  more 
examples  to  train  on  and  thus  the  more  likely  the  K-SVD  is  to  find  Dopt.  When  the 
image  has  a  wide  variability  of  content,  one  could  try  to  use  a  larger  dictionary  (with 
more  redundancy),  but  our  extensive  experiments  show  that  this  does  not  significantly 
improve  the  results.  This  might  be  explained  by  the  increased  risk  of  getting  stuck  in  a 
local  minima  in  the  K-SVD  training,  or  perhaps  the  reason  is  the  reduced  performance 
of  the  OMP,  due  to  bad  coherence  of  the  resulting  larger  dictionary.  This  is  why  the 
K-SVD  has  been  used  so  far  with  a  fixed  size  and  relatively  small  dictionary,  which 
already  provides  excellent  performance. 

Nevertheless,  one  way  of  addressing  the  above  mentioned  problem  is  to  handle 
separately  different  zones  of  the  images.  In  this  paper,  we  chose  to  naturally  define 
a  block  denoising  algorithm,  but  future  work  will  consist  of  combining  our  denoising 
algorithm  with  a  segmentation  of  the  input  image.  More  precisely,  what  we  do  is  to 
consider  blocks  of  the  same  size  from  one  image,  with  a  small  overlap  of  the  same 
width  as  the  patches’s  size  as  illustrated  Figure  5.I.1  Given  a  judiciously  adapted 


1Note  that  since  pixels  are  recovered  as  linear  combination  of  overlapping  patches,  this  will 
attenuate  the  common  artifacts  at  the  boundary  of  the  segments. 
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block  size,  this  approach  introduces  two  advantages:  First  of  all,  the  performance  in 
terms  of  denoising  results  is  better  since  the  dictionaries  are  better  adapted  to  their 
own  regions,  as  we  can  notice  on  Figure  5.1.  Secondly,  this  approach  has  the  same 
computational  complexity  with  a  lower  memory  usage,  since  both  are  linear  in  the 
number  of  denoised  pixels.2 


Fig.  5.1.  Illustration  of  the  block  denoising  algorithm.  Four  dictionaries  trained  on  four  dif¬ 
ferent  blocks  (out  of  the  overall  9  that  we  have)  of  a  noisy  image  barbara  with  a  =  15  during  a 
denoising  process  are  reported  on  the  figure.  As  can  be  seen,  each  dictionary  is  more  adapted  to  the 
content  it  is  serving.  E.g.,  the  top  left  dictionary  does  not  contain  textured  atoms,  as  those  are  not 
needed  in  this  part  of  the  image.  On  the  other  hand,  the  bottom  right  dictionary  is  practically  loaded 
with  such  textured  atoms,  as  those  are  crucial  and  dominant  in  that  part  of  the  image. 


One  natural  question  that  is  raised  here  is  whether  there  exists  a  generic  optimal 
size  to  choose  for  these  blocks.  Answering  this  requires  to  take  into  account  several 
considerations: 

•  The  bigger  the  blocks  are,  the  more  information  from  the  image  is  taken  into 
account  each  time  the  K-SVD  is  performed.  On  the  downsize,  though,  bigger 
blocks  imply  more  diversity  of  the  image  content,  and  less  flexibility  of  the 
dictionary  to  handle  this  content  well. 

•  The  smaller  a  block  is,  the  better  the  K-SVD  can  adapt  the  dictionary  to  it. 
However,  smaller  blocks  imply  a  risk  of  over-fitting,  where  the  dictionary  is 
learning  the  given  examples,  and  absorbs  some  of  the  noise  in  them  as  well. 

•  The  bigger  a  is,  the  more  patches  (and  thus  bigger  blocks)  are  required  to 
make  the  K-SVD  robust  to  the  noise. 

As  expected,  our  experiments  showed  that  the  best  size  for  the  block  denoising  algo¬ 
rithm  was  linked  to  the  amount  of  noise  in  the  image.  The  smaller  the  noise  variance 
is,  the  smaller  the  average  best  size  to  get  the  best  denoising  performance. 

6.  Application  in  image  processing.  Applying  our  multiscale  scheme  to  some 
image  processing  tasks  proved  to  noticeably  improve  the  results  compared  to  the 


2 Note  that  we  neglect  the  small  increase  in  the  number  of  pixels  due  to  the  small  overlapping  of 
the  blocks. 
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single-scale  original  algorithm,  leading  to  state-of-the-art  results  in  many  image  pro¬ 
cessing  tasks.  We  turn  to  present  such  results  below. 

6.1.  Grayscale  image  denoising.  We  now  present  denoising  results  obtained 
within  the  proposed  multiscale  sparsity  framework  and  the  algorithmic  improvements 
that  we  have  introduced.  In  Table  6.1,  our  results  for  N  =  1  (single-scale)  and  N  =  2 
scales  are  carefully  compared  to  the  original  K-SVD  algorithm  [18,  19]  and  the  recent 
results  reported  in  [11]. 3  The  best  results  are  shared  between  our  algorithm  and  [11]. 
As  it  can  be  observed,  the  differences  are  insignificant.  Our  average  performance  is 
better  for  a  <  10  and  for  a  =  50,  while  the  results  from  [11]  are  slightly  better  or 
similar  to  ours  for  other  noise  levels.  Tuning  more  carefully  the  parameters  of  these 
two  algorithms  is  not  expected  to  change  by  much  these  near-equivalent  performance. 
Our  framework  is  of  course  a  general  multiscale  representation  applicable  to  numerous 
image  processing  tasks,  some  of  them  here  demonstrated. 

The  PSNR  values  in  Table  6.1,  corresponding  to  the  results  in  [11,  18,  19]  and 
our  algorithm,  are  averaged  over  5  experiments  for  each  image  and  each  level  of  noise, 
to  cope  with  the  variability  of  the  PSNR  with  the  different  noise  realizations. 

We  also  compared  our  results  with  a  very  recent  paper  [24] ,  which  is  an  extension 
of  [34]  with  noticeable  improvements.  In  this  work,  the  authors  presented  some  exper¬ 
iments  over  a  data  set  that  has  five  images  in  common  with  the  one  we  chose  (house, 
peppers,  lena,  barbara,  boat),  and  4  standard  deviations  for  the  noise  (10,  25,  50,  100). 
For  very  high  noise  ( a  =  100),  their  algorithm  performs  better  than  ours  and  slightly 
better  than  [11].  Nevertheless,  for  other  values  of  noise,  we  have  an  improved  average 
PSNR  of  0.20dB  over  these  five  images. 

During  our  experiments,  the  number  of  atoms  ks  for  each  scale  was  set  to  256,  the 
parameter  A  to  0.45n2/cr,  and  r /,  which  gives  a  preference  of  the  constant  atom  during 
the  OMP,  was  set  to  2.5.  The  other  parameters  used  are  reported  in  Table  6.2.  The 
initial  dictionaries  are  the  results  of  an  off-line  training  on  a  large  generic  database 
of  images  [18,  25].  Some  of  these  dictionaries  are  shown  in  Figure  6.1.  The  so-called 
sparsity  factor  L  for  these  off-line  training  was  set  to  L  =  6  for  N  =  1,  L  =  20  for 
N  =  2,3. 

From  these  experiments,  we  draw  two  conclusions:  First  of  all,  the  algorithmic  im¬ 
provements  and  the  block  denoising  approach  with  N  =  1  lead  to  better  performance 
than  the  original  K-SVD,  and  this  is  achieved  without  increasing  the  computational 
cost.  Secondly,  the  two-scales  algorithm  provides  further  noticeable  improvement  over 
the  single-scale  K-SVD,  which  makes  N  =  2  a  relevant  choice,  although  it  introduces 
a  higher  computational  cost.  A  few  examples  for  N  =  2  are  presented  Figure  6.2. 
Using  N  =  3  scales  can  provide  further  improvement  at  a  higher  computational  cost, 
as  illustrated  in  Table  6.3  for  a  =  10, 15  and  images  of  size  256  x  256.  A  visual  com¬ 
parison  between  the  use  of  different  scales  is  shown  in  Figure  6.3.  In  these  images,  as 
the  denoising  performance  is  already  very  good  for  one  and  two  scales,  the  visual  im¬ 
provements  are  difficult  to  observe.  Nevertheless,  on  the  zoomed  parts  of  the  images, 
one  can  notice  that  N  =  3  provides  a  more  precise  brick  texture  on  the  image  house 
and  less  artifacts  in  the  flat  areas  of  the  image  cameraman. 

Some  examples  of  multiscale  learned  dictionaries  are  presented  in  figures  6.1,  6.4, 
6.5,  and  6.6.  As  we  can  observe,  the  very  strong  structure  from  the  image  barbara  can 
be  observed  through  the  different  scales. 

3The  results  in  [11]  are  the  best  known  denoising  results  at  the  time  of  writing  this  paper.  These 
go  beyond  the  performance  reported  in  [18,  19,  21,  34],  which  until  recently  were  the  leading  ones, 
each  in  its  short  period  of  time. 
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With  N  >  3,  our  multiscale  scheme  proves  not  to  be  flexible  enough  to  be  used, 
since  it  leads  to  significant  computational  cost  and  optimization  problems  of  the  in¬ 
volved  parameters.  Further  work  is  required  to  modify  this  scheme  to  allow  such 
flexibility.  Using  image  pyramids  is  a  topic  we  are  currently  considering. 

We  implemented  a  parallel  version  of  the  algorithm  in  C++  using  OpenMP  for 
parallelism  and  the  Intel  Math  Kernel  Library  for  the  matrix  computation.  On  a 
recent  quad-core  Intel  Xeon  2.33  GHz,  J  =  30  iterations  for  one  200  x  200  block  of 
the  image  lena  with  a  =  15  took  approximately  8  seconds  for  N  m  1  scale  and  58 
seconds  for  N  =  2  scales,  using  the  parameters  from  the  above  experiments.4 


<J 

house 

peppers 

cameraman 

lena 

barbara 

5 

39.37 

39.82 

37.78 

38.09 

37.87 

38.26 

38.60 

38.73 

38.08 

38.30 

39.81 

39.92 

38.07 

38.20 

38.12 

38.32 

38.72 

38.78 

38.34 

38.32 

10 

35.98 

36.68 

34.28 

34.68 

33.73 

34.07 

35.47 

35.90 

34.42 

34.96 

36.38 

36.75 

34.58 

34.62 

34.01 

34.17 

35.75 

35.84 

34.90 

34.86 

15 

34.32 

34.97 

32.22 

32.70 

31.42 

31.83 

33.70 

34.27 

32.37 

33.08 

34.68 

35.00 

32.53 

32.47 

31.68 

31.72 

34.00 

34.14 

32.82 

32.96 

20 

33.20 

33.79 

30.82 

31.33 

29.91 

30.42 

32.38 

33.01 

30.83 

31.77 

33.51 

33.75 

31.15 

31.08 

30.32 

30.37 

32.68 

32.88 

31.37 

31.53 

25 

32.15 

32.87 

29.73 

30.19 

28.85 

29.40 

31.32 

32.06 

29.60 

30.65 

32.39 

32.83 

30.03 

30.04 

29.28 

39.37 

31.63 

31.92 

30.17 

30.29 

50 

27.95 

29.45 

26.13 

26.35 

25.73 

25.86 

27.79 

28.86 

25.47 

27.14 

28.24 

29.40 

26.34 

26.64 

26.06 

26.17 

28.15 

28.80 

26.08 

26.78 

100 

23.71 

25.43 

21.75 

22.91 

21.69 

22.62 

24.46 

25.51 

21.89 

23.49 

23.83 

24.84 

21.94 

22.64 

22.05 

22.84 

24.49 

25.06 

22.07 

22.95 

<j 

boat 

couple 

hill 

Average 

5 

37.22 

37.28 

37.31 

37.50 

37.02 

37.13 

37.91 

38.14 

37.35 

37.35 

37.42 

37.54 

37.11 

37.17 

38.12 

38.20 

10 

33.64 

33.90 

33.52 

34.03 

33.37 

33.60 

34.30 

34.73 

33.93 

33.98 

33.84 

33.97 

33.59 

33.70 

34.62 

34.74 

15 

31.73 

32.10 

31.45 

32.10 

31.47 

31.86 

32.34 

32.86 

32.04 

32.13 

31.83 

31.94 

31.78 

31.88 

32.67 

32.78 

20 

30.36 

30.85 

30.00 

30.74 

30.18 

30.70 

30.96 

31.57 

30.74 

30.82 

30.42 

30.59 

30.53 

30.66 

31.34 

31.46 

25 

29.28 

29.84 

28.90 

29.68 

29.18 

29.82 

29.88 

30.56 

29.67 

29.82 

29.31 

29.51 

29.52 

29.78 

30.25 

30.45 

50 

25.95 

26.56 

25.32 

26.32 

26.27 

27.04 

26.33 

27.20 

26.36 

26.74 

25.78 

26.36 

26.52 

27.04 

26.69 

27.24 

100 

22.81 

23.64 

22.60 

23.39 

23.98 

24.44 

22.86 

23.93 

22.96 

23.67 

22.73 

23.16 

23.92 

24.16 

23.00 

23.67 

Table  6.1 

PSNR  results  of  our  denoising  algorithm  compared  with  some  other  ones.  Each  cell  is  divided 
into  four  parts.  The  top-left  part  shows  the  results  from  the  original  K-SVD  [1],  the  top-right  from 
the  most  recent  state-of-the-art  [11].  The  bottom-left  is  devoted  to  our  results  for  N  =  1  scale  and 
the  bottom-right  to  N  =  2  scales.  Each  time  the  best  results  are  in  bold. 


4The  code  will  be  made  publicly  available  upon  publication. 
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N 

N  =  1 

a 

5 

10 

15 

20 

25 

50 

100 

y/n 

8 

8 

8 

8 

8 

8 

8 

J 

30 

30 

30 

30 

30 

15 

15 

P 

0.5 

0.5 

0.5 

0.5 

0.5 

1.0 

1.0 

m 

1 

1 

64 

64 

64 

64 

64 

C 

1.128 

1.128 

1.041 

1.023 

1.023 

1.018 

1.018 

VSt 

150 

150 

200 

200 

200 

512 

768 

N 

II 

to 

<j 

5 

10 

15 

20 

25 

50 

100 

y/n 

10 

12 

16 

16 

16 

20 

20 

J 

30 

30 

30 

30 

30 

15 

15 

P 

0.5 

0.5 

0.5 

0.5 

0.5 

1.0 

1.0 

rn 

4 

4 

16 

16 

16 

64 

64 

C 

1.069 

1.042 

1.026 

1.026 

1.020 

1.010 

1.008 

y/S( 

150 

200 

200 

250 

400 

512 

768 

Table  6.2 


Parameters  used  for  the  grayscale  denoising  experiments  presented  on  Figure  6.2  and  Table 
6.1.  n  is  the  size  of  the  patches.  J  is  the  number  of  learning  iterations,  g  is  the  fraction  of  patches 
used  during  the  training,  m  is  the  number  of  adjacent  and  non- overlapping  patches  processed  at  the 
same  time  (see  Section  j).  C  is  the  parameter  from  Equation  (2.2).  The  block  denoising  algorithm 
has  been  applied  to  yfSf  x  y/Sb  blocks  when  y/Sf,  was  smaller  than  the  size  of  the  input  image. 


(b)  s  =  1 


Fig.  6.1.  A  learned  3-scales  global  dictionary,  which  has  been  trained  over  a  large  database  of 
natural  images. 


a 

n 

house 

peppers 

cameraman 

Average 

10 

20  X  20 

+0.10dB 

+0.03dB 

+  O.OOdB 

+0.04dB 

15 

20  X  20 

-0.07dB 

+0.18dB 

+0.28dB 

+0.13dB 

15 

24  x  24 

+0.02dB 

+0.13dB 

+0.15dB 

+0.10dB 

Table  6.3 


PSNR  improvements  obtained  using  N  =  3  scales  for  a  —  10  and  a  =  15  compared  to  the  case 
of  N  =  2  scales.  For  N  =  3,  a  dictionary  with  ks  =  256  for  all  s  =  0, 1,  2,  m  =  4,  and  C  =  1.018 
were  used. 


LEARNING  SPARSE  AND  MULTISCALE  REPRESENTATIONS 


15 


(b)  Noisy,  a  —  50 


(c)  Result,  PSNR=26.74dB 


(a)  Original  image  boat 


(d)  Original  image  hill 


(g)  Original  image  lena 


(e)  Noisy,  cr  =  10 


(f)  Result,  PSNR=33. 68dB 


(i)  Result,  PSNR=35.85dB 


Fig.  6.2.  Examples  of  denoising  results  for  N  =  2  scales. 


6.2.  Color  image  denoising.  In  [25],  we  presented  state-of-the-art  results  for 
color  image  denoising  using  the  previously  described  modified  version  of  the  K-SVD. 
These  results  have  recently  been  slightly  surpassed  [10].  Here  we  apply  our  multiscale 
framework  and  our  algorithmic  improvements  to  the  color  denoising  K-SVD  to  show 
that  it  can  compete  and  provide  again  state-of-the-art  results.  Like  in  [25],  we  use  a 
data  set  composed  of  natural  images  from  the  Berkeley  Segmentation  Database  [30], 
see  Figure  6.7. 

Numerical  results  are  presented  on  Table  6.4  and  some  visual  results  in  Figure 
6.8.  All  the  numbers  presented  here  are  averaged  over  5  experiments  for  each  image 
and  each  level  of  noise.  The  parameters  used  during  the  experiments  are  reported  in 
Table  6.5,  where  we  can  observe  that  our  experiments  indicate  that  for  N  =  2,  the 
parameter  7  proves  to  be  useful  only  for  high  noise  levels  (a  >  25). 


16 


J.  MAIRAL,  G.  SAPIRO  AND  M.  ELAD 


(a)  Original  (b)  Noisy ,  cr  =  10  (c)  N  =  1  (d)  N  =  2  (e)  N  =  3 

image  house  PSNR=36.36dB  PSNR=36.74dB  PSNR=36.85dB 


(k)  Original  (1)  Noisy,  o  —  15  (m)  N  —  1  (n)  N  =  2  (o)  TV  =  3 

image  cameraman  PSNR=31.71dB  PSNR=31.73dB  PSNR=32.01dB 


(p)  Zoom  on  (k) 


(q)  Zoom  on  (l) 


(r)  Zoom  on  (m) 


(s)  Zoom  on  (n) 


(t)  Zoom  on  (o) 


Fig.  6.3.  A  comparison  between  N  =  1,2,3  scales. 


As  we  can  see,  our  model  with  N  =  1  is  already  close  to  [10]  (-0.03dB  on  average) 
and  even  slightly  better  for  a  <  5  (+0.05dB).  With  N  =  2  scales,  we  have  an  average 
improvement  of  +0.06dB  over  the  single-scale  algorithm  and  +0.04dB  over  [10].  One 
can  note  also  that  our  color  denoising  algorithm  is  a  lot  more  efficient  than  handling 
each  R,G,B  channel  separately,  providing  a  very  important  average  improvement  of 
2.65dB  on  our  dataset.  For  illustrative  purpose,  some  color  multiscale  dictionaries  are 
presented  Figure  6.9.  Very  interestingly,  the  color  information  seems  to  be  present 
mainly  at  the  coarse  scale. 

6.3.  Image  inpainting.  Filling-in  small  holes  in  images  was  presented  in  [25] 
using  the  K-SVD  algorithm.  Here,  we  show  that  using  more  than  one  scale  can  lead 
to  visually  impressive  results.  For  illustrative  purposes,  we  show  an  example  obtained 
with  N  =  2  scales  in  Figure  6.10,  compared  with  N  ml.  This  result  is  quite  impressive 
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Fig.  6.4.  A  learned  3-scales  dictionary,  which  has  been  trained  over  a  noisy  version  of  the 
image  barbara,  with  cr  =  15.  This  image  is  presented  Figure  5.1.  The  initial  dictionary  is  a  global 
one,  presented  in  Figure  6.1 


cr 

castle 

mushroom 

train 

5 

40.84 

38.27 

40.20 

37.65 

39.91 

36.52 

40.77 

40.79 

40.26 

40.26 

40.04 

40.03 

10 

36.61 

34.25 

35.94 

33.46 

34.85 

31.37 

36.51 

36.65 

35.88 

35.92 

34.90 

34.93 

15 

34.39 

31.95 

33.61 

31.21 

31.95 

28.53 

34.22 

34.37 

33.51 

33.58 

31.98 

32.04 

20 

32.84 

30.52 

31.99 

29.74 

29.97 

26.79 

32.63 

32.77 

31.86 

31.97 

29.97 

30.01 

25 

31.68 

29.47 

30.84 

28.69 

28.45 

26.55 

31.45 

31.59 

30.67 

30.75 

28.50 

28.53 

cr 

horses 

kangaroo 

Average 

5 

40.46 

37.17 

39.13 

35.73 

40.11 

37.07 

40.44 

40.45 

39.26 

39.25 

40.15 

40.16 

10 

35.78 

32.70 

34.29 

31.20 

35.49 

32.60 

35.67 

35.75 

34.31 

34.34 

35.45 

35.52 

15 

33.18 

30.48 

31.63 

29.05 

32.95 

30.24 

33.11 

33.19 

31.71 

31.75 

32.91 

32.99 

20 

31.44 

29.13 

29.85 

27.77 

31.22 

28.79 

31.35 

31.47 

29.99 

30.07 

31.16 

31.26 

25 

30.19 

28.21 

28.65 

26.90 

29.96 

27.96 

30.19 

30.28 

28.82 

28.87 

29.93 

30.00 

Table  6.4 

PSNR  results  for  our  color  image  denoising  experiments.  Each  cell  is  composed  of  four  parts: 
The  top-left  is  devoted  to  [10],  the  top-right  to  our  2-scales  gray  image  denoising  method  applied 
to  each  R,G,B  channel  independently,  the  bottom-left  to  the  color  denoising  algorithm  with  N  =  1 
scale  and  the  bottom-right  to  our  algorithm  with  N  =  2  scales.  Each  time  the  best  results  are  in 
bold. 
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(c)  N  =  2,sm  1 


Fig.  6.5.  Multiscale  dictionaries  that  have  been  trained  over  a  noisy  version  of  the  image  boat, 
with  a  =  15,  N  =  1,  N  =  2  and  N  =  3. 


bearing  in  mind  that  it  is  able  to  retrieve  the  brick  texture  of  the  wall,  something  that 
our  visual  system  is  not  able  to  do.  In  this  example,  the  multiscale  version  provides 
an  improvement  of  2.24 dB  over  the  single-scale  algorithm. 

7.  Applications  to  video  processing.  We  show  now  that  our  framework  can 
be  extended  to  video  processing.  For  illustrative  purposes,  we  choose  to  give  two 
examples:  Color  video  denoising  and  video  inpainting. 

7.1.  Color  video  denoising.  We  now  present  results  obtained  by  combining 
the  color  extension  of  the  multiscale  K-SVD  and  the  video  one.  Figure  7.1  presents 
a  result  obtained  on  a  sequence  of  5  images  taken  from  a  classical  video  sequence, 
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(a)  s  =  0  (b)  s  =  1 

Fig.  6.6.  A  learned  2-scales  dictionary,  which  has  been  trained  on  a  large  set  of  clean  patches 
from  a  database  of  natural  images.  Compare  with  Figure  6.1. 


Fig.  6.7.  Data  used  for  evaluating  the  color  denoising  experiments.  (This  is  a  color  figure.) 


with  added  white  Gaussian  noise  of  standard  deviation  a  =  25.  On  the  third  column, 
we  present  the  results  obtained  by  denoising  each  frame  separately  using  the  multi¬ 
scale  K-SVD  algorithm  for  color  images  using  the  same  parameters  as  in  subsection 
6.2.  On  the  last  column,  we  present  the  output  of  our  multiscale  K-SVD  algorithm 
for  denoising  color  videos  that  takes  into  account  the  temporal  correlation.  As  we 
can  see,  the  multiscale  and  temporal  algorithm  can  provide  both  PSNR  and  visual 
improvements.  The  raw  performance  difference  in  terms  of  PSNR  between  this  two 
methods  is  +1.14dB.  Looking  carefully  at  the  images,  we  see  less  artifacts  and  sharper 
details  on  the  last  column.  In  these  experiments,  we  used  patches  and  atoms  of  size 
n=10xl0x3x3  with  N  =  2  scales.  This  means  that  we  used  three  successive 
frames  to  build  each  patch  and  dictionaries  with  three  temporal  channels.  The  initial 
dictionary  is  a  global  one,  trained  on  a  large  database  of  videos,  with  a  sparsity  factor 
L  =  20.  The  parameter  7  and  7  are  not  used  (7  =*  0.0  and  7  =  1.0),  but  it  proved  to 
be  important  to  introduce  some  constant  atoms  red,  green  and  blue  for  each  temporal 
channel.  The  parameters  m  and  C  are  set  respectively  to  1  and  1.04.  J  =  30  iterations 
are  used  during  the  denoising  for  the  first  algorithm  (that  skips  proper  treatment  of 
the  temporal  domain).  As  we  propagate  the  dictionary,  the  number  of  iterations  J 
during  the  denoising  of  the  next  frames  is  set  to  10.  15  frames  of  the  test  video  were 
processed,  but  only  5  are  shown  in  Figure  7.1. 

7.2.  Video  inpainting.  Figure  7.2  presents  results  obtained  with  the  multiscale 
K-SVD  for  video  inpainting,  and  compare  to  the  result  obtained  when  applying  the 
K-SVD  for  image  inpainting  applied  to  each  frame  separately.  As  we  can  observe, 
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(i)  Denoised 


Fig.  6.8.  Results  for  color  image  denoising  with  2  scales.  For  the  castle  image,  the  resulting 
PSNR  is  36.65dB,  for  the  mushroom  30.78dB,  and  for  the  horses  30.25dB.  (This  is  a  color  figure.) 
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0.5 

0.5 

0.5 

0.5 

0.5 

0.5 

7] 

2.0 

2.0 

2.0 

2.0 

2.0 

1.5 

1.5 

1.5 

1.5 

1.5 

7 

0.0 

1.25 

3.0 

5.25 

5.25 

0.0 

0.0 

0.0 

0.0 

1.25 

m 

64 

64 

64 

64 

64 

4 

16 

64 

64 

64 

C 

1.016 

1.016 

1.014 

1.014 

1.012 

1.019 

1.01 

1.004 

1.003 

1.003 

vs; 

300 

300 

300 

300 

300 

300 

300 

300 

300 

300 

Table  6.5 


Parameters  used  for  the  color  denoising  algorithm,  n  is  the  size  of  the  patches.  J  is  the  number 
of  learning  iterations,  /a  is  the  fraction  of  patches  used  during  the  training.  77  gives  a  preference  to 
the  constant  atom  during  the  OMP.  7  enforces  the  average  color  of  the  patches  (see  [25]).  m  is  the 
number  of  patches  processed  at  the  same  time  (see  Section  4 )■  C  is  the  parameter  from  Equation 
(2.2).  The  block  denoising  algorithm  has  been  applied  to  y/Sf,  X  y/Sf,  blocks  when  y/Sf,  was  smaller 
than  the  size  of  the  input  image. 


(c)  s  =  0 


Fig.  6.9.  Two  learned  2-scales  color  dictionaries.  The  top  one  has  been  trained  over  a  noisy 
version  of  the  image  castle,  with  a  =  10,  the  initial  dictionary  was  a  global  one.  The  bottom 
dictionary  has  been  trained  on  a  large  set  of  clean  patches  from  a  database  of  natural  images.  Since 
the  atoms  can  have  negative  values,  the  vectors  are  presented  scaled  and  shifted  to  the  [0,  255]  range 
per  channel.  (This  is  a  color  figure.) 
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(a)  Original  (b)  Damaged  (c)  Restored ,  N  =  1  (d)  Restored ,  N  =  2 

Fig.  6.10.  Inpainting  using  N  =  2  and  n  —  16  X  16  (third  image),  or  N  =  1  and  n  =  8x8 
(fourth  image).  J  =  100  iterations  were  performed.  During  the  learning,  50%  of  the  patches  were 
used.  A  sparsity  factor  L  =  10  has  been  used  during  the  learning  process  and  L  =  25  for  the  final 
reconstruction.  The  damaged  image  was  created  by  removing  75%  of  the  data  from  the  original 
image.  The  initial  PSNR  is  6.13 dB.  The  resulting  PSNR  for  N  =  2  is  33.97 dB,  and  31.75 dB  for 
N  =  1. 


taking  into  account  the  temporal  behavior  permits  to  achieve  better  results  in  terms 
of  PSNR  and  visual  quality.  The  parameters  used  when  we  applied  the  K-SVD  for 
images  on  each  frame  separately  were  the  same  as  in  the  experiments  in  Figure  6.10, 
with  J  =  60.  For  the  multiscale  K-SVD  for  video  inpainting  algorithm,  we  used 
patches  and  atoms  of  size  n  =  10  x  10  x  5  with  N  =  2  scales  (with  5  temporal 
channels).  The  initial  dictionary  is  a  global  one,  trained  on  a  large  database  of  videos, 
with  a  sparsity  factor  L  —  20.  The  parameter  ij  is  set  to  2.0.  J  =  30  iterations  are 
used  during  the  processing  of  the  first  multi- frame,  and  then  only  10.  We  present  5 
out  of  the  15  processed  frames. 

8.  Conclusion  and  future  directions.  In  this  paper  we  presented  a  K-SVD 
based  algorithm  that  is  able  to  learn  multiscale  sparse  image  representations.  Using 
a  shift-invariant  sparsity  prior  on  natural  images,  the  proposed  framework  achieves 
state-of-the-art  image  restoration  results.  We  have  shown  that  this  framework  can  be 
adapted  to  video  processing,  exploiting  temporal  information.  All  of  the  experiments 
reported  in  this  paper  can  be  reproduced  with  a  C++  software,  which  will  be  freely 
available  in  the  authors’  webpage.  Our  current  efforts  are  devoted  in  part  to  the  design 
of  faster  algorithms,  which  can  be  used  with  any  number  of  scales.  One  direction  we 
are  pursuing  is  to  combine  the  K-SVD  with  image  pyramids.  Results  along  this 
direction  will  be  hopefully  reported  soon. 

At  the  more  general  level,  we  ask  ourself  if  we  are  reaching  the  performance 
limit  for  many  image  and  video  enhancement  tasks  such  as  the  image  denoising  and 
demosaicing  results  presented  here  and  in  [25].  Understanding  these  limits  is  critical 
to  evaluate  the  importance  of  future  efforts  in  these  challenging  problems. 

Acknowledgments.  We  would  like  to  thank  the  authors  of  [11,  10] ,  for  providing 
very  efficient  and  intuitive  implementations  of  the  BM3D  and  CBM3D  algorithms. 
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(a)  Original 


(b)  Damaged 


(c)  Image  Denoising  (d)  Video  Denoising 


(e)  Zoom  on  (a) 


(f)  Zoom  on  (b) 


(g)  Zoom  on  (c)  (h)  Zoom  on  (d) 


Fig.  7.1.  Results  obtained  with  the  proposed  multiscale  K-SVD  for  video  denoising.  From  left 
to  right:  5  frames  of  an  original  video,  the  same  frames  with  Gaussian  additive  noise  (a  —  2h),  the 
results  obtained  when  applying  the  color  image  denoising  algorithm  working  on  each  frame  separately 
(PSNR:  27.lIf.dB),  and  the  result  of  the  proposed  color  video  denoising  multiscale  K-SVD  (PSNR: 
28.28dB).  The  last  row  presents  a  zoomed  version  of  one  part  of  the  last  frame.  (This  is  a  color 
figure.) 
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(a)  Original 


(b)  Damaged 


(c)  Image  Inpainting  (d)  Video  Inpainting 


Fig.  7.2.  Results  obtained  with  the  proposed  multiscale  K-SVD  for  video  inpainting.  From  left 
to  right:  5  frames  of  a  video  are  shown,  the  same  sequence  with  80%  of  data  missing,  the  results 
obtained  when  applying  the  image  inpainting  algorithm  to  each  frame  separately  (PSNR:  2f.38dB), 
and  the  result  of  the  new  video  inpainting  K-SVD  (PSNR:  28.f9dB).  The  last  row  presents  a  zoomed 
version  of  one  part  of  the  last  frame. 


